Skip to content

[receiver/awslambda] Add multi-format S3 log routing#47237

Draft
MichaelKatsoulis wants to merge 14 commits intoopen-telemetry:mainfrom
MichaelKatsoulis:feature/awslambdareceiver-s3-multi-format
Draft

[receiver/awslambda] Add multi-format S3 log routing#47237
MichaelKatsoulis wants to merge 14 commits intoopen-telemetry:mainfrom
MichaelKatsoulis:feature/awslambdareceiver-s3-multi-format

Conversation

@MichaelKatsoulis
Copy link
Copy Markdown
Contributor

Description

This PR adds support for routing S3 objects to different encoding extensions based on their key prefix within a
single Lambda deployment.
This is useful when a Lambda receives events from S3 buckets that
store multiple log types (e.g. VPC Flow Logs and CloudTrail in the same bucket, or across
multiple buckets with different log types).

We introduce a new encodings field in the S3 receiver config that can be used like this:

extensions:
  awslogs_encoding/vpcflow:
    format: vpcflow
    vpcflow:
      file_format: plain-text
  awslogs_encoding/cloudtrail:
    format: cloudtrail

receivers:
  awslambda:
    s3:
      encodings:
        - name: vpcflow
          encoding: awslogs_encoding/vpcflow     # decode VPC Flow Log fields into structured records
        - name: cloudtrail
          encoding: awslogs_encoding/cloudtrail  # decode CloudTrail JSON events into structured records
          path_pattern: "myorg/*/CloudTrail"     # optional: override default (AWSLogs/*/CloudTrail); omit to use the default
        - name: catchall
          path_pattern: "*"                      # forward anything else as raw bytes

The existing encoding field is unchanged. encoding and encodings are mutually
exclusive.

Link to tracking issue

Part of #46458

Testing

Unit Testing.
Pending E2E testing.

Documentation

Readme has been updated.

@MichaelKatsoulis
Copy link
Copy Markdown
Contributor Author

@axw This is the first implementation of the multi-format log support in awslambda receiver.
It focuses only in logs stored in S3.

One thing that I am not sure about is the multiFormatS3LogsHandler which lives alongside to the pre-existing S3LogsHandler.
When we add support for multi-format for logs coming from CloudWatch we could maybe create a single routedLogsHandler or something similar to be used for both cases of S3 and CW.

Comment on lines +33 to +34
patternParts := strings.Split(pattern, "/")
targetParts := strings.Split(target, "/")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll be doing a lot of pattern matching, should we optimise this?

  • The pattern can be split at receiver construction time, so we only create a slice once per pattern
  • The object key can be split once per S3 event, rather than once per encoding pattern

}
if strings.Contains(p, "*") {
// Glob matching within a segment (e.g. "eni-*", "*_CloudTrail_*").
matched, _ := path.Match(p, targetParts[i])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also handle "?" and character classes etc. Do we want that?

If yes, we should document it. If no, we should probably create a variation of path.Match that only handles *. I think we'd be better of simplifying.

// matchPrefixWithWildcard("eni-0abc123-all", "eni-*") => true
// matchPrefixWithWildcard("123_CloudTrail_us-east-1", "*_CloudTrail_*") => true
// matchPrefixWithWildcard("any/path", "*") => true
func matchPrefixWithWildcard(target, pattern string) bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we be better off supporting only a single * or an exact match for S3 path parts? It seems unnecessary to support glob matching in the middle of a path part.

That would simplify the matching -- no need for path.Match, just an exact match or accept anything. I think that makes things easier to reason about. We could also optimise the code further by using a trie (prefix tree), but that's probably overkill given how much time will be spend in matching vs. I/O and parsing.

If we did this, then I think we'll end up with two flavours of patterns:

  • For S3 object keys: path matching with either exact path part match or full wildcard part match.
  • For CloudWatch log group and stream names: prefix and/or suffix wildcard only (*_foo_*, but not foo*bar)

WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree at all the simplification suggestions.
I was initially thinking, "why not support that too?", but that's seems to be an overkill and complicates things.

…m:MichaelKatsoulis/opentelemetry-collector-contrib into feature/awslambdareceiver-s3-multi-format
@MichaelKatsoulis
Copy link
Copy Markdown
Contributor Author

@axw after doing these two main things:

  • Pattern split moved to construction time. strings.Split(pattern, "/") now runs once per encoding when the router is built, not on every GetDecoder call.
  • Object key split moved outside the pattern loop. strings.Split(objectKey, "/") now runs once at the top of GetDecoder, then the same targetParts slice is reused for every pattern comparison.

we can see clear performance gains.

Clear difference:

  ns/op B/op allocs/op
Split every call (old) 1786 1584 18
Pre-split patterns + split key once (new) 382 384 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants